Using syllable-based indexing features and language models to improve German spoken document retrieval
نویسندگان
چکیده
Spoken document collections with high word-type/word-token ratios and heterogeneous audio continue to constitute a challenge for information retrieval. The experimental results reported in this paper demonstrate that syllable-based indexing features can outperform word-based indexing features on such a domain, and that syllable-based speech recognition language models can successfully be used to generate syllable-based indexing features. Recognition is carried out with a 5k syllable language model and a 10k mixed-unit language model whose vocabulary consists of a mixture of words and syllables. Both language models make retrieval performance possible that is comparable to that attained when a large vocabulary wordbased language model is used. Experiments are performed on a spoken document collection consisting of short Germanlanguage radio documentaries. First, the vector space model is applied to a known item retrieval task and a similar-document search. Then, the known item retrieval task is further explored with a Levenshtein-distance-based fuzzy word match.
منابع مشابه
Improved Chinese Spoken D with Hybrid Modeling and D Feature
Different models retrieve the documents based on different approaches of extracting the underlying content. Different levels of indexing features also offer different functionalities and discriminabilities when retrieving the documents. In this paper, we present results for Chinese spoken document retrieval with hybrid models to integrate the knowledge obtainable from three basic retrieval mode...
متن کاملImproved Chinese spoken document retrieval with hybrid modeling and data-driven indexing features
Different models retrieve the documents based on different approaches of extracting the underlying content. Different levels of indexing features also offer different functionalities and discriminabilities when retrieving the documents. In this paper, we present results for Chinese spoken document retrieval with hybrid models to integrate the knowledge obtainable from three basic retrieval mode...
متن کاملRetrieval of mandarin broadcast news using spoken queries
Considering the monosyllabic structure of the Chinese language, a whole class of indexing features for retrieval of Mandarin broadcast news using syllable-level statistical characteristics has been previously investigated. This paper presents the improvements achieved over the previous results. The major differences are: (1) Multi-scale characterand word-level indexing terms have been integrate...
متن کاملMulti-scale and Multi-model Integratio in Chinese Spoken Docume
This paper describes our attempt to combine the relative merits of different indexing units (scales) and different retrieval models to improve performance in Chinese spoken document retrieval. Our study includes indexing units from three scales: words, character bigrams and syllable bigrams. We also include two different retrieval models: the HMM-based model and the vector space model (VSM). Ou...
متن کاملAn HMM/n-gram-based linguistic processing approach for Mandarin spoken document retrieval
In this paper an HMM/N-gram-based linguistic processing approach for Mandarin spoken document retrieval is presented. The underlying characteristics and different structures of this approach were extensively investigated. The retrieval capabilities were verified by tests with indexing features of wordand syllable(subword)-levels and comparison with the conventional vector space model approach. ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003